Project 2: Breakout Strategy¶

The other packages that we're importing are helper, project_helper, and project_tests. These are custom packages built to solve the problems. The helper and project_helper module contains utility functions and graph functions. The project_tests contains the unit tests for all the problems.

Install Packages¶

In [26]:
import sys
#!{sys.executable} -m pip install -r requirements.txt

Load Packages¶

In [27]:
import pandas as pd
import numpy as np
import helper
import project_helper
import project_tests
import warnings
warnings.filterwarnings('ignore')

Market Data¶

Load Data¶

In [28]:
df_original = pd.read_csv('eod-quotemedia.csv', parse_dates=['date'], index_col=False)

# Add TB sector to the market
df = df_original
df = pd.concat([df] + project_helper.generate_tb_sector(df[df['ticker'] == 'AAPL']['date']), ignore_index=True)

close = df.reset_index().pivot(index='date', columns='ticker', values='adj_close')
high = df.reset_index().pivot(index='date', columns='ticker', values='adj_high')
low = df.reset_index().pivot(index='date', columns='ticker', values='adj_low')

print('Loaded Data')
Loaded Data

View Data¶

To see what one of these 2-d matrices looks like, let's take a look at the closing prices matrix.

Stock Example¶

Let's see what a single stock looks like from the closing prices. For this example and future display examples in this project, we'll use Apple's stock (AAPL). If we tried to graph all the stocks, it would be too much information.

In [29]:
apple_ticker = 'AAPL'
project_helper.plot_stock(close[apple_ticker], '{} Stock'.format(apple_ticker))

The Alpha Research Process¶

In this project you will code and evaluate a "breakout" signal. It is important to understand where these steps fit in the alpha research workflow. The signal-to-noise ratio in trading signals is very low and, as such, it is very easy to fall into the trap of overfitting to noise. It is therefore inadvisable to jump right into signal coding. To help mitigate overfitting, it is best to start with a general observation and hypothesis; i.e., you should be able to answer the following question before you touch any data:

What feature of markets or investor behaviour would lead to a persistent anomaly that my signal will try to use?

Ideally the assumptions behind the hypothesis will be testable before you actually code and evaluate the signal itself. The workflow therefore is as follows:

image

In this project, we assume that the first three steps area done ("observe & research", "form hypothesis", "validate hypothesis"). The hypothesis you'll be using for this project is the following:

  • In the absence of news or significant investor trading interest, stocks oscillate in a range.
  • Traders seek to capitalize on this range-bound behaviour periodically by selling/shorting at the top of the range and buying/covering at the bottom of the range. This behaviour reinforces the existence of the range.
  • When stocks break out of the range, due to, e.g., a significant news release or from market pressure from a large investor:
    • the liquidity traders who have been providing liquidity at the bounds of the range seek to cover their positions to mitigate losses, thus magnifying the move out of the range, and
    • the move out of the range attracts other investor interest; these investors, due to the behavioural bias of herding (e.g., Herd Behavior) build positions which favor continuation of the trend.

Using this hypothesis, let start coding..

Compute the Highs and Lows in a Window¶

You'll use the price highs and lows as an indicator for the breakout strategy. In this section, implement get_high_lows_lookback to get the maximum high price and minimum low price over a window of days. The variable lookback_days contains the number of days to look in the past. Make sure this doesn't include the current day.

In [30]:
def get_high_lows_lookback(high, low, lookback_days):
    """
    Get the highs and lows in a lookback window.
    
    Parameters
    ----------
    high : DataFrame
        High price for each ticker and date
    low : DataFrame
        Low price for each ticker and date
    lookback_days : int
        The number of days to look back
    
    Returns
    -------
    lookback_high : DataFrame
        Lookback high price for each ticker and date
    lookback_low : DataFrame
        Lookback low price for each ticker and date
    """
    lookback_high = high.rolling(lookback_days).max().shift(1)
    lookback_low = low.rolling(lookback_days).min().shift(1)
    return lookback_high, lookback_low

project_tests.test_get_high_lows_lookback(get_high_lows_lookback)
Tests Passed

View Data¶

Let's use your implementation of get_high_lows_lookback to get the highs and lows for the past 50 days and compare it to it their respective stock. Just like last time, we'll use Apple's stock as the example to look at.

In [31]:
lookback_days = 50
lookback_high, lookback_low = get_high_lows_lookback(high, low, lookback_days)
project_helper.plot_high_low(
    close[apple_ticker],
    lookback_high[apple_ticker],
    lookback_low[apple_ticker],
    'High and Low of {} Stock'.format(apple_ticker))

Compute Long and Short Signals¶

Using the generated indicator of highs and lows, create long and short signals using a breakout strategy. Implement get_long_short to generate the following signals:

Signal Condition
-1 Low > Close Price
1 High < Close Price
0 Otherwise

In this chart, Close Price is the close parameter. Low and High are the values generated from get_high_lows_lookback, the lookback_high and lookback_low parameters.

In [32]:
def get_long_short(close, lookback_high, lookback_low):
    """
    Generate the signals long, short, and do nothing.
    
    Parameters
    ----------
    close : DataFrame
        Close price for each ticker and date
    lookback_high : DataFrame
        Lookback high price for each ticker and date
    lookback_low : DataFrame
        Lookback low price for each ticker and date
    
    Returns
    -------
    long_short : DataFrame
        The long, short, and do nothing signals for each ticker and date
    """
#     long_short = close.copy()
#     long_short[:] = 0
#     long_short = long_short.astype(int)
#     long_short[close < lookback_low] = -1
#     long_short[close > lookback_high] = 1    
#     return long_short
    return ((close < lookback_low).astype(int) * -1) + (close > lookback_high).astype(int)

project_tests.test_get_long_short(get_long_short)
Tests Passed

View Data¶

Let's compare the signals you generated against the close prices. This chart will show a lot of signals. Too many in fact. We'll talk about filtering the redundant signals in the next problem.

In [33]:
signal = get_long_short(close, lookback_high, lookback_low)
project_helper.plot_signal(
    close[apple_ticker],
    signal[apple_ticker],
    'Long and Short of {} Stock'.format(apple_ticker))

Filter Signals¶

That was a lot of repeated signals! If we're already shorting a stock, having an additional signal to short a stock isn't helpful for this strategy. This also applies to additional long signals when the last signal was long.

Implement filter_signals to filter out repeated long or short signals within the lookahead_days. If the previous signal was the same, change the signal to 0 (do nothing signal). For example, say you have a single stock time series that is

[1, 0, 1, 0, 1, 0, -1, -1]

Running filter_signals with a lookahead of 3 days should turn those signals into

[1, 0, 0, 0, 1, 0, -1, 0]

To help you implement the function, we have provided you with the clear_signals function. This will remove all signals within a window after the last signal. For example, say you're using a windows size of 3 with clear_signals. It would turn the Series of long signals

[0, 1, 0, 0, 1, 1, 0, 1, 0]

into

[0, 1, 0, 0, 0, 1, 0, 0, 0]

clear_signals only takes a Series of the same type of signals, where 1 is the signal and 0 is no signal. It can't take a mix of long and short signals. Using this function, implement filter_signals.

For implementing filter_signals, we don't reccommend you try to find a vectorized solution. Instead, you should use the iterrows over each column.

In [34]:
def clear_signals(signals, window_size):
    """
    Clear out signals in a Series of just long or short signals.
    
    Remove the number of signals down to 1 within the window size time period.
    
    Parameters
    ----------
    signals : Pandas Series
        The long, short, or do nothing signals
    window_size : int
        The number of days to have a single signal       
    
    Returns
    -------
    signals : Pandas Series
        Signals with the signals removed from the window size
    """
    # Start with buffer of window size
    # This handles the edge case of calculating past_signal in the beginning
    clean_signals = [0]*window_size
    
    for signal_i, current_signal in enumerate(signals):
        # Check if there was a signal in the past window_size of days
        has_past_signal = bool(sum(clean_signals[signal_i:signal_i+window_size]))
        # Use the current signal if there's no past signal, else 0/False
        clean_signals.append(not has_past_signal and current_signal)
        
    # Remove buffer
    clean_signals = clean_signals[window_size:]

    # Return the signals as a Series of Ints
    return pd.Series(np.array(clean_signals).astype(int), signals.index)


def filter_signals(signal, lookahead_days):
    """
    Filter out signals in a DataFrame.
    
    Parameters
    ----------
    signal : DataFrame
        The long, short, and do nothing signals for each ticker and date
    lookahead_days : int
        The number of days to look ahead
    
    Returns
    -------
    filtered_signal : DataFrame
        The filtered long, short, and do nothing signals for each ticker and date
    """
    filtered_signal = signal.copy()
    filtered_signal[:] = 0
    for label, content in signal.iteritems():
        # Substitue -1 to 0 in positive signals, then apply the clear_signal function.
        pos_signal = content.replace(-1, 0)
        pos_signal_cleared = clear_signals(pos_signal, lookahead_days)
        
        # Substitue 1 to 0 in negative signals, then apply the clear_signal function.
        neg_signal = content.replace(1, 0)
        neg_signal_cleared = clear_signals(neg_signal, lookahead_days)
        
        filtered_signal[label] = pos_signal_cleared + neg_signal_cleared
        
    
    return filtered_signal

project_tests.test_filter_signals(filter_signals)
Tests Passed

View Data¶

Let's view the same chart as before, but with the redundant signals removed.

In [35]:
signal_5 = filter_signals(signal, 5)
signal_10 = filter_signals(signal, 10)
signal_20 = filter_signals(signal, 20)
for signal_data, signal_days in [(signal_5, 5), (signal_10, 10), (signal_20, 20)]:
    project_helper.plot_signal(
        close[apple_ticker],
        signal_data[apple_ticker],
        'Long and Short of {} Stock with {} day signal window'.format(apple_ticker, signal_days))

Lookahead Close Prices¶

With the trading signal done, we can start working on evaluating how many days to short or long the stocks. In this problem, implement get_lookahead_prices to get the close price days ahead in time. You can get the number of days from the variable lookahead_days. We'll use the lookahead prices to calculate future returns in another problem.

In [36]:
def get_lookahead_prices(close, lookahead_days):
    """
    Get the lookahead prices for `lookahead_days` number of days.
    
    Parameters
    ----------
    close : DataFrame
        Close price for each ticker and date
    lookahead_days : int
        The number of days to look ahead
    
    Returns
    -------
    lookahead_prices : DataFrame
        The lookahead prices for each ticker and date
    """
    lookahead_prices = close.shift(-lookahead_days)
    
    return lookahead_prices

project_tests.test_get_lookahead_prices(get_lookahead_prices)
Tests Passed

View Data¶

Using the get_lookahead_prices function, let's generate lookahead closing prices for 5, 10, and 20 days.

Let's also chart a subsection of a few months of the Apple stock instead of years. This will allow you to view the differences between the 5, 10, and 20 day lookaheads. Otherwise, they will mesh together when looking at a chart that is zoomed out.

In [37]:
lookahead_5 = get_lookahead_prices(close, 5)
lookahead_10 = get_lookahead_prices(close, 10)
lookahead_20 = get_lookahead_prices(close, 20)
project_helper.plot_lookahead_prices(
    close[apple_ticker].iloc[150:250],
    [
        (lookahead_5[apple_ticker].iloc[150:250], 5),
        (lookahead_10[apple_ticker].iloc[150:250], 10),
        (lookahead_20[apple_ticker].iloc[150:250], 20)],
    '5, 10, and 20 day Lookahead Prices for Slice of {} Stock'.format(apple_ticker))

Lookahead Price Returns¶

Implement get_return_lookahead to generate the log price return between the closing price and the lookahead price.

In [38]:
def get_return_lookahead(close, lookahead_prices):
    """
    Calculate the log returns from the lookahead days to the signal day.
    
    Parameters
    ----------
    close : DataFrame
        Close price for each ticker and date
    lookahead_prices : DataFrame
        The lookahead prices for each ticker and date
    
    Returns
    -------
    lookahead_returns : DataFrame
        The lookahead log returns for each ticker and date
    """
    lookahead_returns = np.log(lookahead_prices / close)
    
    return lookahead_returns

project_tests.test_get_return_lookahead(get_return_lookahead)
Tests Passed

View Data¶

Using the same lookahead prices and same subsection of the Apple stock from the previous problem, we'll view the lookahead returns.

In order to view price returns on the same chart as the stock, a second y-axis will be added. When viewing this chart, the axis for the price of the stock will be on the left side, like previous charts. The axis for price returns will be located on the right side.

In [39]:
price_return_5 = get_return_lookahead(close, lookahead_5)
price_return_10 = get_return_lookahead(close, lookahead_10)
price_return_20 = get_return_lookahead(close, lookahead_20)
project_helper.plot_price_returns(
    close[apple_ticker].iloc[150:250],
    [
        (price_return_5[apple_ticker].iloc[150:250], 5),
        (price_return_10[apple_ticker].iloc[150:250], 10),
        (price_return_20[apple_ticker].iloc[150:250], 20)],
    '5, 10, and 20 day Lookahead Returns for Slice {} Stock'.format(apple_ticker))

Compute the Signal Return¶

Using the price returns generate the signal returns.

In [40]:
def get_signal_return(signal, lookahead_returns):
    """
    Compute the signal returns.
    
    Parameters
    ----------
    signal : DataFrame
        The long, short, and do nothing signals for each ticker and date
    lookahead_returns : DataFrame
        The lookahead log returns for each ticker and date
    
    Returns
    -------
    signal_return : DataFrame
        Signal returns for each ticker and date
    """
    signal_return = signal * lookahead_returns
    
    return signal_return

project_tests.test_get_signal_return(get_signal_return)
Tests Passed

View Data¶

Let's continue using the previous lookahead prices to view the signal returns. Just like before, the axis for the signal returns is on the right side of the chart.

In [41]:
title_string = '{} day LookaheadSignal Returns for {} Stock'
signal_return_5 = get_signal_return(signal_5, price_return_5)
signal_return_10 = get_signal_return(signal_10, price_return_10)
signal_return_20 = get_signal_return(signal_20, price_return_20)
project_helper.plot_signal_returns(
    close[apple_ticker],
    [
        (signal_return_5[apple_ticker], signal_5[apple_ticker], 5),
        (signal_return_10[apple_ticker], signal_10[apple_ticker], 10),
        (signal_return_20[apple_ticker], signal_20[apple_ticker], 20)],
    [title_string.format(5, apple_ticker), title_string.format(10, apple_ticker), title_string.format(20, apple_ticker)])

Test for Significance¶

Histogram¶

Let's plot a histogram of the signal return values.

In [42]:
project_helper.plot_signal_histograms(
    [signal_return_5, signal_return_10, signal_return_20],
    'Signal Return',
    ('5 Days', '10 Days', '20 Days'))

Question: What do the histograms tell you about the signal returns?¶

To answer this question think about the following:

  • Are the histograms skewed? If yes, which side?
  • Is the skewing because of outliers? What could be the reason for this?
  • Which side do the outliers appear and what does it signify?
  • What will happen to the distribution of the histograms if the outliers are removed?

Another way to check if a distribution is normal is to create a QQ (quantile-quantile) plot. See this link to learn more about QQ plots.

#TODO: Put Answer In this Cell
The signal returns datasets are close to normal distributions, but they are not normal distributions, because they have outliers close to the mean values and in the tails values (that´s very clear for the case of 10 days and 20 days returns).

Outliers¶

You might have noticed the outliers in the 10 and 20 day histograms. To better visualize the outliers, let's compare the 5, 10, and 20 day signals returns to normal distributions with the same mean and deviation for each signal return distributions.

In [43]:
project_helper.plot_signal_to_normal_histograms(
    [signal_return_5, signal_return_10, signal_return_20],
    'Signal Return',
    ('5 Days', '10 Days', '20 Days'))

Kolmogorov-Smirnov Test¶

While you can see the outliers in the histogram, we need to find the stocks that are causing these outlying returns. We'll use the Kolmogorov-Smirnov Test or KS-Test. This test will be applied to teach ticker's signal returns where a long or short signal exits.

In [44]:
# Filter out returns that don't have a long or short signal.
long_short_signal_returns_5 = signal_return_5[signal_5 != 0].stack()
long_short_signal_returns_10 = signal_return_10[signal_10 != 0].stack()
long_short_signal_returns_20 = signal_return_20[signal_20 != 0].stack()

# Get just ticker and signal return
long_short_signal_returns_5 = long_short_signal_returns_5.reset_index().iloc[:, [1,2]]
long_short_signal_returns_5.columns = ['ticker', 'signal_return']
long_short_signal_returns_10 = long_short_signal_returns_10.reset_index().iloc[:, [1,2]]
long_short_signal_returns_10.columns = ['ticker', 'signal_return']
long_short_signal_returns_20 = long_short_signal_returns_20.reset_index().iloc[:, [1,2]]
long_short_signal_returns_20.columns = ['ticker', 'signal_return']

# View some of the data
long_short_signal_returns_5
Out[44]:
ticker signal_return
0 A 0.00732604
1 ABC 0.01639650
2 ADP 0.00981520
3 AGENEN 0.02359321
4 AKAM 0.04400495
... ... ...
22630 MSI -0.02166808
22631 NOV -0.04026558
22632 SYLVES 0.04115905
22633 SYY -0.00817960
22634 WDC -0.04679319

22635 rows × 2 columns

This gives you the data to use in the KS-Test.

Now it's time to implement the function calculate_kstest to use Kolmogorov-Smirnov test (KS test) between a distribution of stock returns (the input dataframe in this case) and each stock's signal returns. Run KS test on a normal distribution against each stock's signal returns. Use scipy.stats.kstest perform the KS test. When calculating the standard deviation of the signal returns, make sure to set the delta degrees of freedom to 0.

For this function, we don't reccommend you try to find a vectorized solution. Instead, you should iterate over the groupby function.

Hint: You should compare the signal return of the individual tickers against a normal distribution whose parameters - mean and standard deviation are computed from all tickers.

In [45]:
from scipy.stats import kstest


def calculate_kstest(long_short_signal_returns):
    """
    Calculate the KS-Test against the signal returns with a long or short signal.
    
    Parameters
    ----------
    long_short_signal_returns : DataFrame
        The signal returns which have a signal.
        This DataFrame contains two columns, "ticker" and "signal_return"
    
    Returns
    -------
    ks_values : Pandas Series
        KS static for all the tickers
    p_values : Pandas Series
        P value for all the tickers
    """
    #Initialize empty list for tickers, ks_values, and p_values.
    tickers_l = []
    ks_l = []
    p_l = []
    
    # Compute population mean and std for kstest
    pop_mean = np.mean(long_short_signal_returns['signal_return'])
    pop_std = np.std(long_short_signal_returns['signal_return'])
    args = (pop_mean, pop_std)
    
    for ticker, group in long_short_signal_returns.groupby(['ticker']):
        ks, p = kstest(group['signal_return'], 'norm', args)
        tickers_l.append(ticker)
        ks_l.append(ks)
        p_l.append(p)
    
    ks_values = pd.Series(data=ks_l, index=tickers_l)
    p_values = pd.Series(data=p_l, index=tickers_l)
    return ks_values, p_values

#project_tests.test_calculate_kstest(calculate_kstest)

View Data¶

Using the signal returns we created above, let's calculate the ks and p values.

In [46]:
ks_values_5, p_values_5 = calculate_kstest(long_short_signal_returns_5)
ks_values_10, p_values_10 = calculate_kstest(long_short_signal_returns_10)
ks_values_20, p_values_20 = calculate_kstest(long_short_signal_returns_20)

print('ks_values_5')
print(ks_values_5.head(10))
print('p_values_5')
print(p_values_5.head(10))
ks_values_5
A      0.17231321
AAL    0.10736810
AAP    0.19713963
AAPL   0.15567361
ABBV   0.16834393
ABC    0.21421483
ABT    0.21390867
ACN    0.28237998
ADBE   0.24284166
ADI    0.19444645
dtype: float64
p_values_5
A      0.18610961
AAL    0.69187425
AAP    0.04471885
AAPL   0.24661445
ABBV   0.24537275
ABC    0.02722741
ABT    0.04800357
ACN    0.00581580
ADBE   0.00905889
ADI    0.09839965
dtype: float64

Find Outliers¶

With the ks and p values calculate, let's find which symbols are the outliers. Implement the find_outliers function to find the following outliers:

  • Symbols that pass the null hypothesis with a p-value less than pvalue_threshold AND with a KS value above ks_threshold.

Note: your function should return symbols that meet both requirements above.

In [47]:
def find_outliers(ks_values, p_values, ks_threshold, pvalue_threshold=0.05):
    """
    Find outlying symbols using KS values and P-values
    
    Parameters
    ----------
    ks_values : Pandas Series
        KS static for all the tickers
    p_values : Pandas Series
        P value for all the tickers
    ks_threshold : float
        The threshold for the KS statistic
    pvalue_threshold : float
        The threshold for the p-value
    
    Returns
    -------
    outliers : set of str
        Symbols that are outliers
    """
    # initialize outliers list
    outliers = []
    # loop over tickers and values in ks_values Series
    for ticker,value in ks_values.iteritems():
        # if p_value for the ticker is less than pvalue_threshold
        # or ks_value for the ticker is greater than ks_threshold
        if (p_values[ticker]<pvalue_threshold) and (ks_values[ticker]>ks_threshold):
            # append ticker to the list of outliers
            outliers.append(ticker)
    # transform the outliers list in a set
    outliers = set(outliers)
    # return outliers set
    return outliers


project_tests.test_find_outliers(find_outliers)
Tests Passed

View Data¶

Using the find_outliers function you implemented, let's see what we found.

In [48]:
ks_threshold = 0.4
outliers_5 = find_outliers(ks_values_5, p_values_5, ks_threshold)
outliers_10 = find_outliers(ks_values_10, p_values_10, ks_threshold)
outliers_20 = find_outliers(ks_values_20, p_values_20, ks_threshold)

outlier_tickers = outliers_5.union(outliers_10).union(outliers_20)
print('{} Outliers Found:\n{}'.format(len(outlier_tickers), ', '.join(list(outlier_tickers))))
27 Outliers Found:
KOLPAK, PULCHE, TARDA, HPE, SAXATI, URUMIE, SYLVES, SCHREN, GESNER, GREIGI, KHC, KAUFMA, VVEDEN, CLUSIA, BIFLOR, K, DASYST, ALTAIC, PRAEST, HUMILI, ORPHAN, AGENEN, TURKES, LINIFO, BAKERI, ARMENA, SPRENG

Show Significance without Outliers¶

Let's compare the 5, 10, and 20 day signals returns without outliers to normal distributions. Also, let's see how the P-Value has changed with the outliers removed.

In [49]:
good_tickers = list(set(close.columns) - outlier_tickers)

project_helper.plot_signal_to_normal_histograms(
    [signal_return_5[good_tickers], signal_return_10[good_tickers], signal_return_20[good_tickers]],
    'Signal Return Without Outliers',
    ('5 Days', '10 Days', '20 Days'))

That's more like it! The returns are closer to a normal distribution. You have finished the research phase of a Breakout Strategy. You can now submit your project.